Exploratory Study of Child Blood Lead Levels Nationally and in Philadelphia

BMIN503/EPID600 Final Project

Author

Ashley Trocle


1 Overview

This project aims to explore trends in childhood lead exposures. To understand this issue, data will be summarized and analyzed at national and local (Philadelphia) levels, utilizing the National Health and Nutrition Examination Survey (NHANES), the American Community Survey (ASC), the EPA’s Environmental Justice Indices, and OpenDataPhilly.Throughout this project, I consulted with three Penn Professors: Dr Cheryl Bettigole, the former health commissioner of Philadelphia, Dr Amin Chen, an environmental epidemiologist, and Dr John Holmes, a social epidemiologist. They provided insights on the availability and interpretation of data, important historical contexts, and key policy knowledge.

Materials for this project can be found here: GitHub Repository

2 Introduction

2.1 The Problem

Exposure to elevated lead levels during childhood is a critical environmental and public health issue. Lead is a neuro-toxin that impacts neurological and behavioral development in children. Measured most commonly via blood lead levels (BLL), there is no safe level of lead exposure in children. Even low levels of lead exposure during developmental stages can lead to lifelong adverse effects on cognitive function and attention.

A primary source of exposure for young children is deteriorating lead based paint, which is still present in many homes despite a national ban in 1978. Children commonly ingest paint chips or paint dust. Both acute and chronic exposure to lead can be dangerous. Elevated blood lead levels continue to disproportionately affect marginalized communities, especially low-income and minority groups who often live in older, non-remidiated housing. Addressing lead exposure is key to addressing health disparities nationally and in our communities.

2.2 The Policies

A number of federal, state and local policies have dramatically reduces the rates of lead exposures in the past 50 years. Understanding these policies is key to analyzing current and historical trends in the data.

National:

  • 1974: Safe Drinking Water Act
  • 1978: Lead based paints were banned in residences
  • 1999: Lead Safe Housing Rule
  • 2021: Biden-Harris Lead Pipe and Paint Action Plan

Local:

  • 2011: Philadelphia Lead Paint Disclosure Law
  • 2019: Expansion of the 2011 law to include all rental properties, not only those with children

2.3 This Project

Using data from the NHANES, ASC, OpenDataPhilly, this project sought to understand the relationship between lead exposure and various demographic, socioeconomic and neighborhood characteristics.

This project takes an interdisciplinary approach to better understand lead exposure issues at national, state and local levels. The project integrates social epidemiology, environmental health, public health and policy analysis to address the complex nature of issues that contribute to lead exposure and the significant impact it can have on children.

3 Methods

The study uses four datasets to understand lead exposure in children, providing both broad and detailed overviews of the issue. A detailed description of these datasets is provided below.

After loading the necessary packages, the data is uploaded and cleaned below.

3.1 Loading Packages

#Load Packages
library(nhanesA)
library(tidyverse)
library(haven)
library(dplyr)
library(ggplot2)
library(survey)
library(sf)
library(tidycensus)
library(ggspatial)
library(leaflet)
library(RColorBrewer)
library(broom)
library(tigris)
library(tableone)
library(survey)
options(tigris_use_cache = TRUE)
options(progress_enabled = FALSE)

3.2 Loading and Cleaning Data

NHANES: NHANES data was initially pulled using the nhanesA package prior to the package being removed. The code below will be used a reference or will be used in the future if the package is made available again.

NHANES data was loaded from two time periods 2015-2016 and 2021-2023 (the most recent survey available). This survey is used to assess the health of American children. Demographic, income, and laboratory (BLL) data was used in this study. Based on 2021 criteria from the CDC, we consider BLLs over 3.5 ug/dl as elevated.

#import NHANES_I and NHANES_L (2015-2016 and 2021-2023): Demographics, Income, and Phlebotomy modules
varlist <- c("DEMO", "PBCD", "INQ") 
varlist_years <- paste0(rep(varlist, each = 2), c("_I", "_L"))

#Load everything from varlist_years in nhanes as a list)
list_all <- sapply(varlist_years, function(x) {data.frame(nhanes(x))}) 

#Create a data.frame for each module
for(i in 1:length(list_all)) {
  assign(names(list_all)[i], list_all[[i]])
} 

#Combine modules from each year into one larger data.frame
for (i in 1:length(varlist)) {
  assign(varlist[i], plyr::rbind.fill(mget(grep(varlist[i], ls(), value = T))))
} 
rm(list = grep("_[IL]", ls(), value = T))

#Create a single data.frame that combines all modules 
nhanes.data <- full_join(get(varlist[1]), get(varlist[2]), by = "SEQN")

#Create a single data.frame that combines all modules 
for (i in 3:length(varlist)){
  nhanes.data <- full_join(nhanes.data, get(varlist[i]), by = "SEQN")
} 

rm(list = ls()[-which(ls() == "nhanes.data")])

names(nhanes.data) #confirm loaded correctly

#Limit the dataset to the necessary variables and limit to children under 5
nhanes.data.sub <- nhanes.data |>
  select( id = SEQN, survey = SDDSRVYR, gender = RIAGENDR, age = RIDAGEYR, race3 = RIDRETH3, birth_country= DMDBORN4, timeUS = DMDYRUSR, income_poverty = INDFMPIR,  bll=LBXBPB, blldetected =LBDBPBLC, poverty_level= INDFMMPC, IP_ratio= INDFMPIR, cash= IND310, interview_weight=WTINT2YR, MEC_weight =WTMEC2YR, samplingunit = SDMVPSU,strata = SDMVSTRA) |>
   mutate(gender = factor(case_when(
      gender == "Male" ~ 0,
      gender == "Female" ~ 1),
    levels = c(0, 1), 
    labels = c("Male", "Female")))|>
   mutate(survey = factor(case_when(
      survey == 9 ~ 0,
      survey == 12 ~ 1
    ), levels = c(0, 1), 
    labels = c("2015-2016", "2021-2023")))|>
  filter(age <10)|>
  mutate(agecat = case_when(
      age < 2 ~ "1-<2",
      age >= 2 & age < 3 ~ "2-<3",
      age >= 3 & age < 4 ~ "3-<4",
      age >= 4 & age < 5 ~ "4-<5",
      age >= 5 & age < 6 ~ "5-<6",
      age >= 7 & age < 8 ~ "7-<8",
      age >= 8 & age < 9 ~ "8-<9",
      age >= 9 & age < 10 ~ "9-<10")
  )

#planning to relevel or mutate the other variables
head(nhanes.data.sub)

#Weighting data (still working on how to use this in the Results)
nhanes.data.sub$adjusted_weight <- nhanes.data.sub$MEC_weight / 2

nhanes_design <- svydesign(
  id = ~samplingunit,  
  strata = ~strata,  
  weights = ~adjusted_weight, 
  data = nhanes.data.sub,
  nest = TRUE
)

summary(nhanes_design)

#Create a Dataset with BLL values for those < 18 years of age
nhanes.data.bll <-nhanes.data.sub|>
  filter (!is.na(bll)) |>
  mutate( elevated.bll = case_when(
       bll < 0.5 ~ "not elevated",
       bll >= 0.5 ~ "elevated"),
       elevated.bll = factor(elevated.bll, 
                             levels = c("not elevated", "elevated")))
    
head(nhanes.data.bll)
table(nhanes.data.bll$survey, nhanes.data.bll$elevated.bll)

NHANES (data from website): Following removal of the nhanesA package, the data was pulled and loaded from files on the NHANES website. This data is also included in the repository.

#Load the indivudal files downloaded from the NHANES sites 
#import NHANES_I and NHANES_L (2015-2016 and 2021-2023): Demographics, Income, and Phlebotomy modules
files <- c("DEMO_I.xpt", "INQ_I.xpt", "PBCD_I.xpt",
           "DEMO_L.xpt", "INQ_L.xpt", "PBCD_L.xpt")

# Create a list
data_list <- lapply(files, read_xpt)
names(data_list) <- files

# Combine datasets for each year using purrr::reduce with full_join
combine_year <- function(pattern) {
  files_subset <- names(data_list)[grep(pattern, names(data_list))]
  reduce(data_list[files_subset], full_join, by = NULL)}

# Combine datasets for "_I" and "_L"
data_I <- combine_year("_I")
Joining with `by = join_by(SEQN)`
Joining with `by = join_by(SEQN)`
data_L <- combine_year("_L")
Joining with `by = join_by(SEQN)`
Joining with `by = join_by(SEQN)`
# Combine data 
nhanes.data <- bind_rows(data_I, data_L)

#Limit the dataset to the necessary variables and limit to children under 5
nhanes.data.sub <- nhanes.data |>
  select( id = SEQN, survey = SDDSRVYR, gender = RIAGENDR, age = RIDAGEYR, race = RIDRETH3, hh_education = DMDHREDU,  bll=LBXBPB, blldetected =LBDBPBLC, poverty_level= INDFMMPC, assets = IND310, interview_weight=WTINT2YR, lab_weight = WTPH2YR, MEC_weight =WTMEC2YR, samplingunit = SDMVPSU,strata = SDMVSTRA) |> 
        mutate(gender = factor(ifelse(gender == 1, 0, 1), 
                             levels = c(0, 1), 
                             labels = c("Male", "Female")))|>
        mutate(survey = factor(case_when(
            survey == 9 ~ 0,
            survey == 12 ~ 1
          ), levels = c(0, 1), 
          labels = c("2015-2016", "2021-2023")))|>
        filter(age <10)|> #limit the data to children under 10 who are most at risk of developing more severe outcome long term. 
        mutate(agecat = case_when(
            age < 2 ~ "1-<2",
            age >= 2 & age < 3 ~ "2-<3",
            age >= 3 & age < 4 ~ "3-<4",
            age >= 4 & age < 5 ~ "4-<5",
            age >= 5 & age < 6 ~ "5-<6",
            age >= 7 & age < 8 ~ "7-<8",
            age >= 8 & age < 9 ~ "8-<9",
            age >= 9 & age < 10 ~ "9-<10"))|>
        mutate(race = factor(race, 
                     levels = c(1, 2, 3, 4, 6, 7), 
                     labels = c("Mexican American", 
                                "Other Hispanic", 
                                "Non-Hispanic White", 
                                "Non-Hispanic Black", 
                                "Non-Hispanic Asian", 
                                "Other Race - Inc. Multi-Racial")))|>
        mutate(hh_education = factor(ifelse(hh_education == ".", NA, hh_education),
                             levels = c(1, 2, 3), 
                             labels = c("Less than high school degree", 
                                        "High school/ some college", 
                                        "College graduate or above")))|>
        mutate(poverty_level = factor(ifelse(poverty_level %in% 
                                           c(7, 9, "."), NA, poverty_level),
                              levels = c(1, 2, 3),
                              labels = c("< 1.30", 
                                         "1.31 - 1.85", 
                                         " > 1.85")))|> 
        mutate(assets = factor(ifelse(assets %in% c(77, 99, "."), NA, assets),
                       levels = c(1, 2, 3, 4, 5),
                       labels = c("Less than $3000", 
                                  "$3001-$5000", 
                                  "$5001-$10000", 
                                  "$10001-$15000", 
                                  "$15001-$20000")))

head(nhanes.data.sub) #review data cleaning
# A tibble: 6 × 16
     id survey   gender   age race  hh_education   bll blldetected poverty_level
  <dbl> <fct>    <fct>  <dbl> <fct> <fct>        <dbl>       <dbl> <fct>        
1 83739 2015-20… Male       4 Non-… <NA>          0.29           0 " > 1.85"    
2 83740 2015-20… Male       1 Othe… <NA>         NA             NA  <NA>        
3 83746 2015-20… Female     4 Non-… <NA>         NA             NA " > 1.85"    
4 83748 2015-20… Male       3 Non-… <NA>         NA             NA "< 1.30"     
5 83760 2015-20… Female     3 Non-… College gra…  0.44           0 "< 1.30"     
6 83763 2015-20… Female     2 Othe… <NA>         NA             NA " > 1.85"    
# ℹ 7 more variables: assets <fct>, interview_weight <dbl>, lab_weight <dbl>,
#   MEC_weight <dbl>, samplingunit <dbl>, strata <dbl>, agecat <chr>
#Create a Dataset with BLL values as elevated of not elevated
nhanes.data.bll <-nhanes.data.sub|>
  filter (blldetected == 0)|>
  mutate( elevated.bll = case_when(
       bll < 3.5 ~ "not elevated",
       bll >= 3.5 ~ "elevated"),
       elevated.bll = factor(elevated.bll, 
                             levels = c("not elevated", "elevated")))

head(nhanes.data.bll)
# A tibble: 6 × 17
     id survey   gender   age race  hh_education   bll blldetected poverty_level
  <dbl> <fct>    <fct>  <dbl> <fct> <fct>        <dbl>       <dbl> <fct>        
1 83739 2015-20… Male       4 Non-… <NA>          0.29           0 " > 1.85"    
2 83760 2015-20… Female     3 Non-… College gra…  0.44           0 "< 1.30"     
3 83772 2015-20… Female     2 Othe… <NA>          0.55           0 "< 1.30"     
4 83780 2015-20… Male       4 Mexi… <NA>          0.96           0 " > 1.85"    
5 83792 2015-20… Female     3 Mexi… College gra…  1.61           0 " > 1.85"    
6 83797 2015-20… Female     1 Othe… <NA>          0.49           0 " > 1.85"    
# ℹ 8 more variables: assets <fct>, interview_weight <dbl>, lab_weight <dbl>,
#   MEC_weight <dbl>, samplingunit <dbl>, strata <dbl>, agecat <chr>,
#   elevated.bll <fct>
table(nhanes.data.bll$survey, nhanes.data.bll$elevated.bll)
           
            not elevated elevated
  2015-2016         1443       28
  2021-2023          645        4
#Weighting data (using 2 sets of data based on NHANES instructions)
nhanes.data.bll$adjusted_weight <- nhanes.data.bll$MEC_weight / 2

nhanes.data.bll <- nhanes.data.bll |>
  filter(!is.na(adjusted_weight))

nhanes_design <- svydesign(
  id = ~samplingunit,  
  strata = ~strata,  
  weights = ~adjusted_weight, 
  data = nhanes.data.bll,
  nest = TRUE
)

summary(nhanes_design) #review the weighted data 
Stratified 1 - level Cluster Sampling design (with replacement)
With (60) clusters.
svydesign(id = ~samplingunit, strata = ~strata, weights = ~adjusted_weight, 
    data = nhanes.data.bll, nest = TRUE)
Probabilities:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
1.821e-05 7.797e-05 1.233e-04 1.277e-04 1.701e-04 3.764e-04 
Stratum Sizes: 
           119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 173 174
obs         50 142  85  81  91  69  75 143  86 115 110 134 135 114  41  40  38
design.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
actual.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
           175 176 177 178 179 180 181 182 183 184 185 186 187
obs         46  41  40  41  54  36  47  53  37  42  37  33  64
design.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2
actual.PSU   2   2   2   2   2   2   2   2   2   2   2   2   2
Data variables:
 [1] "id"               "survey"           "gender"           "age"             
 [5] "race"             "hh_education"     "bll"              "blldetected"     
 [9] "poverty_level"    "assets"           "interview_weight" "lab_weight"      
[13] "MEC_weight"       "samplingunit"     "strata"           "agecat"          
[17] "elevated.bll"     "adjusted_weight" 

ACS Data: The ACS data was loaded. The ACS collects a wide variety of data from Americans to inform policy. For this study, housing related data from 2015 was collected to assess home ownership and the year a home was built. These are both important considerations given that many policies focus on lead remediation in rental units and properties build before 1978, when lead paint was banned in residential homes. A census API key was previously loaded and is not included in this document. The data was then cleaned and appropriate housing related variables were identified.

#revising which variables collecting from the ACS data to use in the analysis -> will use this for mapping and regression

#National ACS Data by County
acs.data.national<- get_acs(geography= "county",
                   year= 2015,
                   variables = c("B25035_001E", #median year home built
                                 "B25036_001E", #total occupied 
                                 "B25036_002E", #owner occupied
                                 "B25036_013E", #renter occupied
                                 "B25034_001E", #total homes
                                 "B25034_002E", #Built 2020 or later
                                 "B25034_003E", #Built 2010 to 2019
                                 "B25034_004E", #Built 2000 to 2009
                                 "B25034_005E", #Built 1990 to 1999
                                 "B25034_006E", #Built 1980 to 1989
                                 "B25034_007E", #Built 1970 to 1979
                                "B25034_008E", #Built 1960 to 1969
                                "B25034_009E",  #Built 1950 to 1959
                                "B25034_010E",  #Built 1940 to 1949
                                "B25034_011E"), #Built 1939 or earlier
               output = "wide")
Getting data from the 2011-2015 5-year ACS
# Transform the data to create percentages by year and ownership
acs.data.national <- acs.data.national |>
  mutate(Built_after_1980 = B25034_002E + 
      B25034_003E + B25034_004E + B25034_005E + B25034_006E,
    Built_before_1979 = B25034_007E + 
      B25034_008E + B25034_009E + B25034_010E + B25034_011E,
    Percent_after_1980 = (Built_after_1980 / B25034_001E) * 100,
    Percent_before_1979 = (Built_before_1979 / B25034_001E) * 100)|>
  mutate(
    Percent_Owner_Occupied = (B25036_002E / B25036_001E) * 100,
    Percent_Renter_Occupied = (B25036_013E / B25036_001E) * 100)

#To visualize this data, load county polygon data
us.counties1 <- counties(year=2015, cb = TRUE, resolution = "500k") 
us.counties <- us.counties1 |>
  filter(!STATEFP %in% c("02", "15", "60", 
                            "66", "69", 
                            "72", "78"))

#Combine the county polygons with ACS national data
us.counties <- us.counties |> 
  left_join(acs.data.national, by = "GEOID")

#Philadelphia ACS Data by Tract
acs.data.phl<- get_acs(geography= "tract",
                   year= 2015,
                   variables = c("B25035_001E", #median year home built
                                 "B25036_001E", #total occupied 
                                 "B25036_002E", #owner occupied
                                 "B25036_013E", #renter occupied
                                 "B25034_001E", #total homes
                                 "B25034_002E", #Built 2020 or later
                                 "B25034_003E", #Built 2010 to 2019
                                 "B25034_004E", #Built 2000 to 2009
                                 "B25034_005E", #Built 1990 to 1999
                                 "B25034_006E", #Built 1980 to 1989
                                 "B25034_007E", #Built 1970 to 1979
                                "B25034_008E", #Built 1960 to 1969
                                "B25034_009E",  #Built 1950 to 1959
                                "B25034_010E",  #Built 1940 to 1949
                                "B25034_011E"), #Built 1939 or earlier
                              
                  output = "wide",
                  state="PA",
                  county="Philadelphia")
Getting data from the 2011-2015 5-year ACS
# Transform the data to create percentages by year and ownership
acs.data.phl<- acs.data.phl |>
  mutate(
    Built_after_1980 = B25034_002E + 
      B25034_003E + B25034_004E + B25034_005E + B25034_006E,
    Built_before_1979 = B25034_007E + 
      B25034_008E + B25034_009E + B25034_010E + B25034_011E,
    Percent_after_1980 = (Built_after_1980 / B25034_001E) * 100,
    Percent_before_1979 = (Built_before_1979 / B25034_001E) * 100)|>
  mutate(
    Percent_Owner_Occupied = (B25036_002E / B25036_001E) * 100,
    Percent_Renter_Occupied = (B25036_013E / B25036_001E) * 100)

OpenDataPhilly Lead Data: The data set includes the number of children screened for elevated blood lead levels (BLL) defined as >5ug/dL (based on the CDC’s criteria in 2012, incidence of children with elevated BLL, and percent screened with elevated BLL by census tract from 2013-2015. Values are missing where there are less than 6 observations for confidentiality purposes.

#OpenDataPhilly Data 
phllead <- read.csv("C:\\Users\\atrocle\\Documents\\.EPID 6000 Data Science\\Assignments\\Final Project\\opendataphl_lead_tract.csv", header = TRUE)
head(phllead) #confirm loading properly
  census_tract data_redacted num_bll_5plus num_screen perc_5plus
1   4.2101e+10         false             0        100          0
2   4.2101e+10          true            NA        109         NA
3   4.2101e+10          true            NA        110         NA
4   4.2101e+10          true            NA         61         NA
5   4.2101e+10         false             0         41          0
6   4.2101e+10          true            NA         49         NA
#Load Tract Polygon data
philly.tracts <- read_rds("https://raw.githubusercontent.com/HimesGroup/BMIN503/master/DataFiles/philly.tracts.2020.rds")
names(philly.tracts)
 [1] "STATEFP"  "COUNTYFP" "TRACTCE"  "GEOID"    "NAME"     "NAMELSAD"
 [7] "MTFCC"    "FUNCSTAT" "ALAND"    "AWATER"   "INTPTLAT" "INTPTLON"
[13] "geometry"
#Combine spatial with lead data
phllead <- phllead |>
  rename(GEOID = census_tract) |>
  mutate(GEOID = as.character(GEOID))

phllead_geo <- philly.tracts |>
  left_join(phllead, by = "GEOID")  

phllead_geo <- phllead_geo |>
  left_join(acs.data.phl, by = "GEOID")

EPA’s EJ Screen Dataset: The 2024 data was downloaded the the EPA’s website. The EJScreen data is a national dataset that combines environmental indicators with socioeconomic variables. The data can be used by researchers to inform policy related to environmental issues. In this study, we use demographic variables and justice indices related to lead. These indices combine proxies for exposures to lead paint, racial and income level variables.

#Load the data form csv
ejscreen <- read.csv("C:\\Users\\atrocle\\Documents\\.EPID 6000 Data Science\\Assignments\\Final Project\\EJScreen_2024_Tract.csv", header = TRUE)

#Select lead related and demographic variables and rename
ejscreen_limit <- ejscreen|>
  select( GEOID = ID,      
    State = STATE_NAME,   
    County_name = CNTY_NAME,
    Region = REGION,                  
    Demographic_Index = DEMOGIDX_2,   
    Supplemental_Demographic_Index = DEMOGIDX_5, 
    Percent_POC = PEOPCOLORPCT,      
    Percent_Low_Income = LOWINCPCT,  
    Percent_Unemployed = UNEMPPCT,    
    Percent_Disabled = DISABILITYPCT, 
    Percent_Limited_English = LINGISOPCT, 
    Percent_No_HS_Education = LESSHSPCT, 
    Percent_Under5 = UNDER5PCT,      
    Percent_Over64 = OVER64PCT,      
    Percent_Low_Life_Expectancy = LIFEEXPPCT, 
    Lead_Paint_Percent = PRE1960PCT,  
    Lead_Paint_EJ_Index = D2_LDPNT,   
    Lead_Paint_Supp_Index = D5_LDPNT  
  )

#Combine EJJ with Philly and polygon data
ejscreen_phila <- ejscreen_limit|>
  filter(State== "PENNSYLVANIA" & County_name =="Philadelphia County") |>
  mutate(GEOID = as.character(GEOID))

#add to OpenDataPhilly Lead Data
phllead_geo <- phllead_geo |>
  left_join(ejscreen_phila, by = "GEOID")
names(phllead_geo)
 [1] "STATEFP"                        "COUNTYFP"                      
 [3] "TRACTCE"                        "GEOID"                         
 [5] "NAME.x"                         "NAMELSAD"                      
 [7] "MTFCC"                          "FUNCSTAT"                      
 [9] "ALAND"                          "AWATER"                        
[11] "INTPTLAT"                       "INTPTLON"                      
[13] "data_redacted"                  "num_bll_5plus"                 
[15] "num_screen"                     "perc_5plus"                    
[17] "NAME.y"                         "B25035_001E"                   
[19] "B25035_001M"                    "B25036_001E"                   
[21] "B25036_001M"                    "B25036_002E"                   
[23] "B25036_002M"                    "B25036_013E"                   
[25] "B25036_013M"                    "B25034_001E"                   
[27] "B25034_001M"                    "B25034_002E"                   
[29] "B25034_002M"                    "B25034_003E"                   
[31] "B25034_003M"                    "B25034_004E"                   
[33] "B25034_004M"                    "B25034_005E"                   
[35] "B25034_005M"                    "B25034_006E"                   
[37] "B25034_006M"                    "B25034_007E"                   
[39] "B25034_007M"                    "B25034_008E"                   
[41] "B25034_008M"                    "B25034_009E"                   
[43] "B25034_009M"                    "B25034_010E"                   
[45] "B25034_010M"                    "B25034_011E"                   
[47] "B25034_011M"                    "Built_after_1980"              
[49] "Built_before_1979"              "Percent_after_1980"            
[51] "Percent_before_1979"            "Percent_Owner_Occupied"        
[53] "Percent_Renter_Occupied"        "State"                         
[55] "County_name"                    "Region"                        
[57] "Demographic_Index"              "Supplemental_Demographic_Index"
[59] "Percent_POC"                    "Percent_Low_Income"            
[61] "Percent_Unemployed"             "Percent_Disabled"              
[63] "Percent_Limited_English"        "Percent_No_HS_Education"       
[65] "Percent_Under5"                 "Percent_Over64"                
[67] "Percent_Low_Life_Expectancy"    "Lead_Paint_Percent"            
[69] "Lead_Paint_EJ_Index"            "Lead_Paint_Supp_Index"         
[71] "geometry"                      

Themes: Finally, themes for maps are loaded.

#Create a theme for maps to be used in future maps
myPalette <- colorRampPalette(brewer.pal(9, "BuPu"))  
map_theme <- function() {
  theme_minimal() + 
  theme(axis.line = element_blank(), 
        axis.text = element_blank(),  
        axis.title = element_blank(),
        panel.grid = element_line(color = "white"),  
        legend.key.size = unit(0.8, "cm"),           
        legend.text = element_text(size = 12),       
        legend.title = element_text(size = 12))}

4 Results

4.1 National Lead Data:

We first analyze national data from NHANES to gain a broader understanding of blood lead levels (BLL) across the United States before narrowing our focus to Philadelphia.

4.1.1 Exploratory Analysis of BLLs in Children

We initally take a look at the housing stock nationally to identify areas of concern based on the prevalence of lead paint in older homes. This analysis shows a concentration of housing build before 1979 in the Midwest and Northeast, reinforcing the need to explore localized data in places like Philadelphia.

pal_fun <- colorNumeric("BuPu", NULL)   
pu_message <- paste0(us.counties$NAME.y,
                     "<br> Percent Built before 1979: ",       
                     round(us.counties$Percent_before_1979), "%")

leaflet(us.counties) |>
  addProviderTiles(providers$CartoDB.Positron)  |>
  addPolygons(stroke = TRUE,
              color = "gray",    
              weight = 0.4,
              fillColor = ~ pal_fun(Percent_before_1979),
              fillOpacity = 0.8, smoothFactor = 0.5, 
              popup = pu_message) |>  
                addTiles()|>
                  addLegend("bottomright",
                    pal = pal_fun,  
                    values = ~ Percent_before_1979,
                    title = 'Built Before 1979 (%)',  
                    opacity = 1) |>              
  addScaleBar(position ="bottomleft")

Next, we analyze elevated blood lead levels (BLLs) across various variables, starting with gender in the 2015-2016 NHANES data. The results reveal an equal distribution of elevated BLL cases between genders.

elevatedgender <- nhanes.data.bll |> 
  filter(elevated.bll == "elevated", survey == "2015-2016") |> 
  group_by(survey, gender) |> 
  summarize(count = n(), .groups = 'drop')

ggplot(elevatedgender, aes(x = factor(survey), y = count, fill = as.factor(gender))) +
  geom_bar(stat = "identity", position = "dodge") + 
  labs(
    title = "Number of Elevated Blood Lead Levels by Survey Year and Gender",
    x = "Survey Years",
    y = "Number of Elevated BLL Cases",
    fill = "Gender"
  ) +
   scale_fill_brewer(palette = "Set2") +
  theme_minimal()

Next, in this chart of elevated BLL in 2015-2016 by race, we see that the greatest number of children with elevated BLL is in white children. Black children account for about half the number of cases compared with white children. Though Hispanic children have fewer counts, combining this with the number of Mexican American children suggests there are similar counts in Hispanic children as Black children. Considering the overall racial population distribution, this suggests that Black and Hispanic children may be over-represented in those with elevated BLLs.

elevatedrace <- nhanes.data.bll |> 
  filter(elevated.bll == "elevated", survey == "2015-2016") |>  # Filter for 2015-2016 and elevated BLL
  group_by(survey, race) |> 
  summarize(count = n(), .groups = 'drop')

ggplot(elevatedrace, aes(x = factor(survey), y = count, fill = as.factor(race))) + 
  geom_bar(stat = "identity", position = "dodge") + 
  labs(
    title = "Number of Elevated Blood Lead Levels in 2015-2016 by Race/Ethnicity",
    x = "Survey Year",
    y = "Number of Elevated BLL Cases",
    fill = "Race/Ethnicity"
  ) +
  scale_fill_brewer(palette = "Set2") + 
  theme_minimal()

Last, we look at the age breakdown. Here we see that the counts of children with elevated BLLs generally increases with decreasing age. This may be related to younger childrens proclivity to ingesting objects, like lead paint and lead paint dust.

elevatedage <- nhanes.data.bll |> 
  filter(elevated.bll == "elevated", 
         survey == "2015-2016", 
         !is.na(agecat)) |>  # Remove rows where agecat is NA
  group_by(survey, agecat) |> 
  summarize(count = n(), .groups = 'drop')

ggplot(elevatedage, aes(x = factor(survey), y = count, fill = as.factor(agecat))) + 
  geom_bar(stat = "identity", position = "dodge") + 
  labs(
    title = "Number of Elevated Blood Lead Levels by Survey Year and Age",
    x = "Survey Year",
    y = "Number of Elevated BLL Cases",
    fill = "Age (years)"
  ) +
  scale_fill_brewer(palette = "Set2") + 
  theme_minimal()

5 Statistical Analysis

Logistic Regression:

In this portion of the analysis, we first carry out a logistic regression with the unweighted data. Using a dicotomized outcome of elevated or not elevated BLLs, we look at 5 potential predictors: gender age, race, education of the head of household, and level of assets available to a family. Assets are used a proxy to understand a families ability to complete lead remediation without outside funding. Both the 2015-2016 and 2021-2023 data is used below.

logisticmodel <- glm(
  elevated.bll ~ gender + agecat + race + hh_education + assets,
  data = nhanes.data.bll,
  family = binomial(link = "logit")
)
summary(logisticmodel)

Call:
glm(formula = elevated.bll ~ gender + agecat + race + hh_education + 
    assets, family = binomial(link = "logit"), data = nhanes.data.bll)

Coefficients:
                                        Estimate Std. Error z value Pr(>|z|)   
(Intercept)                              -2.2045     0.8364  -2.636  0.00839 **
genderFemale                              0.3688     0.6332   0.582  0.56032   
agecat2-<3                               -0.8067     0.7854  -1.027  0.30439   
agecat3-<4                               -1.3993     1.1433  -1.224  0.22099   
agecat4-<5                               -1.8900     1.1281  -1.675  0.09387 . 
agecat5-<6                              -19.1037  3811.8662  -0.005  0.99600   
agecat7-<8                               -1.1933     0.8831  -1.351  0.17658   
agecat8-<9                              -19.0167  3349.5517  -0.006  0.99547   
agecat9-<10                             -19.1629  3225.1094  -0.006  0.99526   
raceOther Hispanic                        0.0295     1.2000   0.025  0.98039   
raceNon-Hispanic White                    1.5256     0.9722   1.569  0.11658   
raceNon-Hispanic Black                    1.4157     0.9751   1.452  0.14653   
raceNon-Hispanic Asian                  -16.0432  5820.6066  -0.003  0.99780   
raceOther Race - Inc. Multi-Racial        1.0454     1.2817   0.816  0.41470   
hh_educationHigh school/ some college    -1.1074     0.8945  -1.238  0.21570   
hh_educationCollege graduate or above    -1.5273     0.8914  -1.713  0.08664 . 
assets$3001-$5000                       -17.6674  4838.7718  -0.004  0.99709   
assets$5001-$10000                      -17.3128  6144.8223  -0.003  0.99775   
assets$10001-$15000                     -18.1140  9994.3711  -0.002  0.99855   
assets$15001-$20000                       1.1825 29423.7137   0.000  0.99997   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 113.608  on 507  degrees of freedom
Residual deviance:  88.344  on 488  degrees of freedom
  (1612 observations deleted due to missingness)
AIC: 128.34

Number of Fisher Scoring iterations: 20

As most values are not significant at a p<0.05 level, there are limitation to the interpretations to this analysis.

Adjusted Analysis:

Next, we look at the adjusted data summarized as a table for both survey years. We then carry out adjusted logistic regression models for the 2015-2016 and 2021-2023 data, separately.

#Table 1: 2015-2016 NHANES Data 
data1516 <- subset(nhanes_design, survey == "2015-2016")
vars <- c("gender", "agecat", "race", "hh_education", "assets")
cat_vars <- c("gender", "agecat", "race", "hh_education", "assets")

# Create Table 1
table1_stratified1 <- svyCreateTableOne(
  vars = vars, 
  strata = "elevated.bll",  # Stratify by elevated BLL status
  data = data1516, 
  factorVars = cat_vars
)
table1_stratified1
                                   Stratified by elevated.bll
                                    not elevated       elevated         p     
  n                                 12922835.5         218732.6               
  gender = Female (%)                6250662.9 (48.4)  112620.2 (51.5)   0.738
  agecat (%)                                                             0.025
     1-<2                            1088785.3 ( 9.5)   84648.8 (38.7)        
     2-<3                            1361561.7 (11.9)   47906.3 (21.9)        
     3-<4                            1243140.9 (10.9)   40093.2 (18.3)        
     4-<5                            1503742.9 (13.1)   24775.4 (11.3)        
     5-<6                            1382000.3 (12.1)   10416.3 ( 4.8)        
     7-<8                            1615963.1 (14.1)   10892.7 ( 5.0)        
     8-<9                            1732519.3 (15.1)       0.0 ( 0.0)        
     9-<10                           1527767.1 (13.3)       0.0 ( 0.0)        
  race (%)                                                               0.542
     Mexican American                2428174.6 (18.8)   29364.3 (13.4)        
     Other Hispanic                  1365198.2 (10.6)   10668.2 ( 4.9)        
     Non-Hispanic White              6060172.0 (46.9)  133200.2 (60.9)        
     Non-Hispanic Black              1749684.5 (13.5)   34820.4 (15.9)        
     Non-Hispanic Asian               478872.6 ( 3.7)    5906.3 ( 2.7)        
     Other Race - Inc. Multi-Racial   840733.7 ( 6.5)    4773.4 ( 2.2)        
  hh_education (%)                                                       0.116
     Less than high school degree    1063290.1 (22.5)   61704.6 (39.8)        
     High school/ some college       1389424.0 (29.3)   55227.4 (35.6)        
     College graduate or above       2281447.5 (48.2)   38112.3 (24.6)        
  assets (%)                                                             0.793
     Less than $3000                 6660215.8 (75.6)  112081.6 (88.9)        
     $3001-$5000                      870796.0 ( 9.9)    8573.9 ( 6.8)        
     $5001-$10000                     698747.1 ( 7.9)    5368.1 ( 4.3)        
     $10001-$15000                    353874.7 ( 4.0)       0.0 ( 0.0)        
     $15001-$20000                    227909.3 ( 2.6)       0.0 ( 0.0)        
                                   Stratified by elevated.bll
                                    test
  n                                     
  gender = Female (%)                   
  agecat (%)                            
     1-<2                               
     2-<3                               
     3-<4                               
     4-<5                               
     5-<6                               
     7-<8                               
     8-<9                               
     9-<10                              
  race (%)                              
     Mexican American                   
     Other Hispanic                     
     Non-Hispanic White                 
     Non-Hispanic Black                 
     Non-Hispanic Asian                 
     Other Race - Inc. Multi-Racial     
  hh_education (%)                      
     Less than high school degree       
     High school/ some college          
     College graduate or above          
  assets (%)                            
     Less than $3000                    
     $3001-$5000                        
     $5001-$10000                       
     $10001-$15000                      
     $15001-$20000                      
#Table 2: 2021-2023 NHANES Data 
data2123 <- subset(nhanes_design, survey == "2021-2023")
vars2 <- c("gender", "agecat", "race", "assets") #Head of household education is removed from this table
cat_vars2 <- c("gender", "agecat", "race",  "assets")

# Create Table 2
table1_stratified2 <- svyCreateTableOne(
  vars = vars2, 
  strata = "elevated.bll",  # Stratify by elevated BLL status
  data = data2123, 
  factorVars = cat_vars2
)
table1_stratified2
                                   Stratified by elevated.bll
                                    not elevated      elevated        p     
  n                                 9137170.0         75084.0               
  gender = Female (%)               4158007.2 (45.5)  47925.9 (63.8)   0.495
  agecat (%)                                                           0.162
     1-<2                            628447.8 ( 7.7)  19931.3 (26.5)        
     2-<3                            641926.2 ( 7.9)   7226.7 ( 9.6)        
     3-<4                            702386.6 ( 8.6)  35050.7 (46.7)        
     4-<5                            988006.8 (12.1)      0.0 ( 0.0)        
     5-<6                           1240417.8 (15.2)      0.0 ( 0.0)        
     7-<8                           1156369.8 (14.2)  12875.2 (17.1)        
     8-<9                           1375547.5 (16.8)      0.0 ( 0.0)        
     9-<10                          1438962.0 (17.6)      0.0 ( 0.0)        
  race (%)                                                             0.232
     Mexican American                881470.2 ( 9.6)      0.0 ( 0.0)        
     Other Hispanic                 1446488.9 (15.8)   7226.7 ( 9.6)        
     Non-Hispanic White             4328299.6 (47.4)  19931.3 (26.5)        
     Non-Hispanic Black             1083354.6 (11.9)      0.0 ( 0.0)        
     Non-Hispanic Asian              608566.2 ( 6.7)  35050.7 (46.7)        
     Other Race - Inc. Multi-Racial  788990.4 ( 8.6)  12875.2 (17.1)        
  assets (%)                                                           0.780
     Less than $3000                3210286.3 (71.9)  42277.4 (68.0)        
     $3001-$5000                     485492.3 (10.9)  19931.3 (32.0)        
     $5001-$10000                    292476.1 ( 6.6)      0.0 ( 0.0)        
     $10001-$15000                   307288.7 ( 6.9)      0.0 ( 0.0)        
     $15001-$20000                   166482.6 ( 3.7)      0.0 ( 0.0)        
                                   Stratified by elevated.bll
                                    test
  n                                     
  gender = Female (%)                   
  agecat (%)                            
     1-<2                               
     2-<3                               
     3-<4                               
     4-<5                               
     5-<6                               
     7-<8                               
     8-<9                               
     9-<10                              
  race (%)                              
     Mexican American                   
     Other Hispanic                     
     Non-Hispanic White                 
     Non-Hispanic Black                 
     Non-Hispanic Asian                 
     Other Race - Inc. Multi-Racial     
  assets (%)                            
     Less than $3000                    
     $3001-$5000                        
     $5001-$10000                       
     $10001-$15000                      
     $15001-$20000                      

From these outputs, we see that there is significant difference in age (2015-2016 data) and between genders (2021-2023). Though statistically insignificant, we see interested tends across the other variables. We also see variations between the two survey years.

Adjusted Logistic Regression:

Lastly, we carry out a logistic regression and calculate odds ratios for the adjusted model to consider the association between each predictor and outcome.

#Logistic Model for 2015-2016 Data
logit_model <- svyglm(elevated.bll ~ gender + agecat + race + poverty_level, 
                      design = data1516, 
                      family = quasibinomial())
summary(logit_model)

Call:
svyglm(formula = elevated.bll ~ gender + agecat + race + poverty_level, 
    design = data1516, family = quasibinomial())

Survey design:
subset(nhanes_design, survey == "2015-2016")

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)
(Intercept)                         -2.9965     0.5966  -5.023      NaN
genderFemale                         0.2479     0.3914   0.633      NaN
agecat2-<3                          -0.6938     0.4061  -1.709      NaN
agecat3-<4                          -0.6974     0.4990  -1.397      NaN
agecat4-<5                          -1.8535     0.7458  -2.485      NaN
agecat5-<6                          -2.1599     0.9942  -2.173      NaN
agecat7-<8                          -2.2601     0.4882  -4.629      NaN
agecat8-<9                         -18.0007     0.4639 -38.801      NaN
agecat9-<10                        -18.0394     0.5600 -32.210      NaN
raceOther Hispanic                  -0.3875     0.9289  -0.417      NaN
raceNon-Hispanic White               0.5349     0.6895   0.776      NaN
raceNon-Hispanic Black               0.2160     0.7326   0.295      NaN
raceNon-Hispanic Asian               0.4241     1.1138   0.381      NaN
raceOther Race - Inc. Multi-Racial  -0.9386     1.0532  -0.891      NaN
poverty_level1.31 - 1.85             0.4486     0.9150   0.490      NaN
poverty_level > 1.85                -0.2579     0.4736  -0.544      NaN

Zero or negative residual df; p-values not defined

(Dispersion parameter for quasibinomial family taken to be 0.8900888)

Number of Fisher Scoring iterations: 19
logit_model_summary <- summary(logit_model)
exp_coef <- exp(coef(logit_model))
conf_int <- exp(confint(logit_model))

# Combine ORs and CIs
or_table <- data.frame(
  Variable = names(exp_coef),
  Odds_Ratio = exp_coef,
  Lower_CI = conf_int[, 1],
  Upper_CI = conf_int[, 2]
)

#Visualize the ORs
ggplot(or_table, aes(x = Variable, y = Odds_Ratio, ymin = Lower_CI, ymax = Upper_CI)) +
  geom_pointrange(color = "purple", size = 1) +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  coord_flip() +
  labs(
    title = "Odds Ratios",
    x = "Predictors",
    y = "Odds Ratio"
  ) +
  theme_minimal()

#Logistic Model for 2021-2023 Data
logit_model2 <- svyglm(elevated.bll ~ gender + agecat + race + poverty_level, 
                      design = data2123, 
                      family = quasibinomial())
summary(logit_model2)

Call:
svyglm(formula = elevated.bll ~ gender + agecat + race + poverty_level, 
    design = data2123, family = quasibinomial())

Survey design:
subset(nhanes_design, survey == "2021-2023")

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)
(Intercept)                          -1.468      1.620  -0.907      NaN
genderFemale                        -19.118      1.195 -15.994      NaN
agecat2-<3                          -38.754      3.009 -12.881      NaN
agecat3-<4                          -36.528      1.735 -21.056      NaN
agecat4-<5                          -58.126      2.604 -22.325      NaN
agecat5-<6                          -60.254      2.506 -24.043      NaN
agecat7-<8                          -60.339      1.702 -35.448      NaN
agecat8-<9                          -60.258      2.535 -23.772      NaN
agecat9-<10                         -60.878      2.737 -22.241      NaN
raceOther Hispanic                   37.142      2.263  16.413      NaN
raceNon-Hispanic White               18.608      1.533  12.135      NaN
raceNon-Hispanic Black              -19.007      1.906  -9.974      NaN
raceNon-Hispanic Asian               58.823      2.027  29.014      NaN
raceOther Race - Inc. Multi-Racial   -1.969      1.724  -1.142      NaN
poverty_level1.31 - 1.85            -39.261      1.536 -25.564      NaN
poverty_level > 1.85                -18.864      1.145 -16.474      NaN

Zero or negative residual df; p-values not defined

(Dispersion parameter for quasibinomial family taken to be 0.04202086)

Number of Fisher Scoring iterations: 24
logit_model_summary2 <- summary(logit_model2)
exp_coef2 <- exp(coef(logit_model2))
conf_int2 <- exp(confint(logit_model2))

# Combine ORs and CIs
or_table2 <- data.frame(
  Variable2 = names(exp_coef2),
  Odds_Ratio2 = exp_coef2,
  Lower_CI2 = conf_int2[, 1],
  Upper_CI2 = conf_int2[, 2]
)

or_table2
                                                            Variable2
(Intercept)                                               (Intercept)
genderFemale                                             genderFemale
agecat2-<3                                                 agecat2-<3
agecat3-<4                                                 agecat3-<4
agecat4-<5                                                 agecat4-<5
agecat5-<6                                                 agecat5-<6
agecat7-<8                                                 agecat7-<8
agecat8-<9                                                 agecat8-<9
agecat9-<10                                               agecat9-<10
raceOther Hispanic                                 raceOther Hispanic
raceNon-Hispanic White                         raceNon-Hispanic White
raceNon-Hispanic Black                         raceNon-Hispanic Black
raceNon-Hispanic Asian                         raceNon-Hispanic Asian
raceOther Race - Inc. Multi-Racial raceOther Race - Inc. Multi-Racial
poverty_level1.31 - 1.85                     poverty_level1.31 - 1.85
poverty_level > 1.85                             poverty_level > 1.85
                                    Odds_Ratio2 Lower_CI2 Upper_CI2
(Intercept)                        2.303027e-01       NaN       NaN
genderFemale                       4.980107e-09       NaN       NaN
agecat2-<3                         1.477557e-17       NaN       NaN
agecat3-<4                         1.368010e-16       NaN       NaN
agecat4-<5                         5.703726e-26       NaN       NaN
agecat5-<6                         6.794472e-27       NaN       NaN
agecat7-<8                         6.241848e-27       NaN       NaN
agecat8-<9                         6.762334e-27       NaN       NaN
agecat9-<10                        3.641009e-27       NaN       NaN
raceOther Hispanic                 1.351269e+16       NaN       NaN
raceNon-Hispanic White             1.206506e+08       NaN       NaN
raceNon-Hispanic Black             5.564499e-09       NaN       NaN
raceNon-Hispanic Asian             3.521041e+25       NaN       NaN
raceOther Race - Inc. Multi-Racial 1.395647e-01       NaN       NaN
poverty_level1.31 - 1.85           8.896328e-18       NaN       NaN
poverty_level > 1.85               6.416913e-09       NaN       NaN

In both models, we see extreme estimate values which suggests that the models may not accurately convey associations. We do not see strong association between the predictors and outcome based on the 2015-2016 adjusted logistic regression model. There is some suggestion that race has a signification association but is likely due to missingness - this should be addressed in future studies for both survey years.

5.1 Local:

In the second portion of the analysis, we use the OpenDataPhilly blood lead levels to understand lead exposures at a local level. We also utilize national data form the ACS and EPA to assess the data.

First, we summarize the data.

#Understand how many tracts are redacted
table(phllead_geo$data_redacted)

false  true 
  236   123 
#Histogram of the distribuion of elevated BLL %
ggplot(phllead_geo, aes(x = perc_5plus)) + 
  geom_histogram(binwidth = 2) + 
  labs(title = "Distribution of % Children with Elevated BLL") 

We see that while many census tracts have less than 5% of children with elevated BLLs, the data is skewed suggesting that there will be obvious areas to target for lead remediation.

Next, we use spatial data to better understand which areas might be most important to consider for targeted interventions. Interpretations are included within the chunk.

#Change the CRS
phllead_geo2 <- st_transform(phllead_geo, crs = 4326)

#Static Map of Children Under 5
ggplot() +
  geom_sf(data = phllead_geo2, aes(fill = Percent_Under5))+
   map_theme() + 
  ggtitle("Tract-level % of Population Under 5 yrs in Philadelphia") + 
  scale_fill_gradientn(name = "% Population (%)",   
                    colours = myPalette(100), 
                       limits = c(0, 0.15),
                    labels = scales::percent_format(accuracy = 1))

#There are no obvious patterns when looking at where children under 5 live

#Static Map of EJ Index
ggplot() +
  geom_sf(data = phllead_geo2, aes(fill = Lead_Paint_EJ_Index)) +  
  map_theme() + 
  ggtitle("Lead Paint Environmental Justice Index in Philadelphia") +  
  scale_fill_gradientn(name = "Lead Paint Environmental Justice Index", 
                       colours = myPalette(100))

#This shows that there might be clear areas where lead paint is most problematic. As this index considers areas where people of color live and low income areas, this therefore provides direction on areas that might be disenfranchised.  

#Static Map of Housing build before 1979 
ggplot() +
  geom_sf(data = phllead_geo2, aes(fill = Percent_before_1979)) +  
  map_theme() + 
  ggtitle("% Housing Built before 1979 in Philadelphia") +  
  scale_fill_gradientn(name = "Percent of Homes", 
                       colours = myPalette(100))

#This shows us that mostly homes across Philly were build before 1979 and the lead paint ban in 1978. So this is not as helpful for targeted policy creation.


ggplot() +
  geom_sf(data = phllead_geo2, aes(fill = Percent_Renter_Occupied)) +  
  map_theme() + 
  ggtitle("% Renter Occupied in Philadelphia") +  
  scale_fill_gradientn(name = "Percent of Homes", 
                       colours = myPalette(100))

#Since the 2011 and 2019 laws are focused on rental properties, it is important to understand where these properties are concentrated

ggplot() +
  geom_sf(data = phllead_geo2, aes(fill = Percent_Owner_Occupied)) +  
  map_theme() + 
  ggtitle("% Owner Occupied in Philadelphia") +  
  scale_fill_gradientn(name = "Percent of Homes", 
                       colours = myPalette(100))

#Conversely, it is helpful to understand where the owned properties are to create interventions that target this group

#Means of Occupancy and Year Build Variables
averages <- data.frame(
  Group = c("Occupancy", "Occupancy", "Year Built", "Year Built"),
  Category = c("Owner Occupied", "Renter Occupied", "Built 1980-Now", "Built Before 1979"),
  Percentage = c(
    mean(phllead_geo2$Percent_Owner_Occupied, na.rm = TRUE),
    mean(phllead_geo2$Percent_Renter_Occupied, na.rm = TRUE),
    mean(phllead_geo2$Percent_after_1980, na.rm = TRUE),
    mean(phllead_geo2$Percent_before_1979, na.rm = TRUE)
  )
)

# Stacked bar chart
ggplot(averages, aes(x = Group, y = Percentage, fill = Category)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(
    title = "Average Percentages Across Tracts",
    x = NULL,
    y = "Average Percentage (%)",
    fill = "Category"
  ) +
  theme_minimal()

#this confirms that most properties were build before 1979. Rental and owned properties are split relatively evenly. It is therefore important to consider spatial concentations.

From these maps, we see that North and Southwest Philadelphia might be areas to focus interventions. We can now directly analyze the BLLs to make more definitive extrapolations. We will be able to validate out hypothesis about where interventions should take place and identify areas most at risk.

#Static Map of Elevated BLL %
ggplot() +
  geom_sf(data = phllead_geo2, aes(fill = perc_5plus))+
   map_theme() + 
  ggtitle("Tract-level % elevated BLL in Philadelphia") + 
  scale_fill_gradientn(name = "% Elevated BLL (%)",   
                    colours = myPalette(100)) 

phllead_geo2 <- phllead_geo2 |>
  mutate(perc_5plus_bins = cut(perc_5plus,
                               breaks = c(0, 2, 4, 6, 8, 10, Inf),
                               labels = c("0-2%", "2-4%", "4-6%", "6-8%", "8-10%", ">10%"),
                               include.lowest = TRUE))

# Plot with the new bins
ggplot() +
  geom_sf(data = phllead_geo2, aes(fill = perc_5plus_bins)) +
  map_theme() + 
  ggtitle("Tract-level % Elevated BLL in Philadelphia") + 
  scale_fill_manual(name = "% Elevated BLL", 
                    values = c("0-2%" = "lightblue", 
                               "2-4%" = "skyblue", 
                               "4-6%" = "yellow", 
                               "6-8%" = "orange", 
                               "8-10%" = "red", 
                               ">10%" = "darkred")) +
  theme_minimal() +
  theme(axis.text = element_blank(),      
        axis.title = element_blank(),     
        axis.ticks = element_blank())

Our analysis shows hotspots of elevated BLLs in the North and Southwest regions of Philadelphia. These areas also show considerable overlap with the EPA lead index map, indicating a possible correlation between elevated BLL and neighborhoods characterized by older housing stock with lead paint. Based on the lead index, these regions are likely communities of color and individuals from lower-income households, highlighting an intersection between environmental and social disparities.

In areas where elevated BLL does not overlap with the EPA’s lead index, it is necessary to further investigate other sources of lead exposure. These could include contaminated soil, lead pipes, or industrial emissions. Identifying and understanding these exposure pathways will provide a more comprehensive picture of the risks and inform targeted interventions.

This analysis confirms the need to prioritize these hotspots regions for interventions. By focusing resources on these vulnerable areas, we can reduce health disparities associated with elevated BLLs in these communities.

Linear Regression:

Finally, we carry out a linear regression incorportating variables from the EPA data and outcome data from OpenDataPhilly.

#Linear Regression: 
modelphl <- lm(perc_5plus ~ Percent_POC + Percent_Low_Income + Percent_Disabled + Percent_No_HS_Education + Percent_Low_Life_Expectancy, data = phllead_geo2)

summary(modelphl)

Call:
lm(formula = perc_5plus ~ Percent_POC + Percent_Low_Income + 
    Percent_Disabled + Percent_No_HS_Education + Percent_Low_Life_Expectancy, 
    data = phllead_geo2)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.0650 -2.1923 -0.2587  1.5390  8.9950 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   -6.323      1.252  -5.048 9.25e-07 ***
Percent_POC                    4.465      1.096   4.074 6.42e-05 ***
Percent_Low_Income             2.416      1.884   1.283 0.200975    
Percent_Disabled              12.101      3.278   3.691 0.000281 ***
Percent_No_HS_Education       -9.488      2.831  -3.352 0.000943 ***
Percent_Low_Life_Expectancy   29.683      6.193   4.793 3.00e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.002 on 223 degrees of freedom
  (179 observations deleted due to missingness)
Multiple R-squared:  0.4131,    Adjusted R-squared:  0.3999 
F-statistic: 31.39 on 5 and 223 DF,  p-value: < 2.2e-16
# Coefficients
coef_data <- tidy(modelphl, conf.int = TRUE)

# Plot
ggplot(coef_data, aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  coord_flip() +
  labs(title = "Coefficient Estimates with CIs",
       x = "Predictors",
       y = "Estimate") +
  theme_minimal()

This linear regression model looks at the relationship between socioeconomic and demographic factors and the prevalence of elevated BLL in children, using aggregated data of the percent of children with elevated BLL in a census tract. Significant predictors include the percentage of people of color, people with disabilities, and those in areas with low life expectancy. These predictors have positive associations with elevated BLLs, confirming our previous spatial analysis and adding in new considerations. Interestingly, a higher percentage of individuals without a high school education is associated with lower BLL. This should be further investigated in future studies.

The model explains approximately 40% of the variability in elevated BLLs based on the R-squared value. Therefore, this model is likely missing key predictors. However, these findings continue to prove the importance of targeting interventions in vulnerable communities to equitably mitigate lead exposure risks.

6 Strength and Limitations

The study includes both nationally representative and geographically specific data to provide a comprehensive evaluation. This allows for cross referencing of historical and current data in situation where this is not available at a specific geographic level. Limitation include missingness which can be dealt with in future studies with imputation and other weighting measures. Though it is useful to use data in multiple time periods, this also poses some challenges in aligning interpretations. Finally, the data is limited due to redaction. This data constraint limits the ability to assess and impact policy.

7 Conclusion and Future Directions

The visualizations and analyses in this project depict current and historical trends in blood lead levels (BLLs) among children in the United States. There are disparities that remain to be seen, disproportionately affecting marginalized communities, as is evident in the analysis of the Philadelphia level data. It is important to continue this research to better understand how communities have been divested in with the intention of creating policy that addresses this disinvestment.

The findings call for targeted interventions. Areas with the highest burden of elevated BLLs should be prioritized for lead remediation efforts, such as replacing infrastructure with lead, enforcing regulations, and providing education on lead exposure prevention. By addressing lead exposures in children through these efforts, these initiatives will address long term consequences of exposure to lead and improve the health of children in the long term.

Future directions of this involves considering and combining statewide data to provide a comprehensive view of the issues and potentially requesting additional data from Penn/CHOP to supplement the City of Philadelphia data. These direction will aim to inform policy and develop targeted solutions to address lead exposures.

8 Acknowledgements

I would like to thank the following faculty members who were consulted on designing and implementing the research question:

  • Dr John Holmes
  • Dr Aimin Chen
  • Dr Cheryl Bettigole

9 References

Egan KB, Cornwell CR, Courtney JG, Ettinger AS. Blood Lead Levels in U.S. Children Ages 1–11 Years, 1976–2016. Environ Health Perspect. 2021;129(3):037003. doi:10.1289/EHP7932

Gasoline and the environment - leaded gasoline - U.S. Energy Information Administration (EIA). https://www.eia.gov/energyexplained/gasoline/gasoline-and-the-environment-leaded-gasoline.php

Jacobs DE, Brown MJ. Childhood Lead Poisoning 1970-2022: Charting Progress and Needed Reforms. Journal of Public Health Management and Practice. 2022;29(2):230. doi:10.1097/PHH.0000000000001664

Childhood Lead Poisoning Surveillance Report | Programs and initiatives | City of Philadelphia. Accessed December 11, 2024. https://www.phila.gov/programs/lead-and-healthy-homes-program/childhood-lead-poisoning-surveillance-report/

Philadelphia, Pennsylvania | Childhood Lead Poisoning Prevention | CDC. https://www.cdc.gov/lead-prevention/success-stories-by-state/philadelphia-pa.html

About Childhood Lead Poisoning Prevention | Childhood Lead Poisoning Prevention | CDC. Accessed December 11, 2024. https://www.cdc.gov/lead-prevention/about/index.html?CDC_AAref_Val=https://www.cdc.gov/nceh/lead/docs/data/state/2017/PA_CountyLevelSummary_2017-508.pdf

Ahrens KA, Haley BA, Rossen LM, Lloyd PC, Aoki Y. Housing assistance and blood lead levels: Children in the United States, 2005-2012. Am J Public Health. 2016;106(11):2049-2056. doi:10.2105/AJPH.2016.303432

FACT SHEET: Biden-Harris Administration Strengthens Standards to Protect Millions from Exposure to Lead Paint Dust, Announces New Actions to Address Toxic Lead Exposure | The White House. https://www.whitehouse.gov/briefing-room/statements-releases/2024/10/24/fact-sheet-biden-harris-administration-strengthens-standards-to-protect-millions-from-exposure-to-lead-paint-dust-announces-new-actions-to-address-toxic-lead-exposure/

Hoffman-Pennesi D, Winfield S, Gavelek A, Santillana Farakos SM, Spungen J. Infants’ and young children’s dietary exposures to lead and cadmium: FDA total diet study 2018–2020. Food Additives and Contaminants - Part A. Published online 2024. doi:10.1080/19440049.2024.2396910

Bravo MA, Zephyr D, Kowal D, Ensor K, Miranda ML. Racial residential segregation shapes the relationship between early childhood lead exposure and fourth-grade standardized test scores. Proc Natl Acad Sci U S A. 2022;119(34). doi:10.1073/PNAS.2117868119

Miranda ML, Lilienfeld A, Tootoo J, Bravo MA. Segregation and Childhood Blood Lead Levels in North Carolina. Pediatrics. 2023;152(3):e2022058661. doi:10.1542/PEDS.2022-058661